Notes on Importin (1QGK)

Importin is significant in that it was the largest protein to be optimized; it contains 876 residues and has the empirical formula C4513H7337N1229O1438S49, resulting in a total of 14,566 atoms.

Preparation of the MOPAC data-set involved only a minimum of operations.  Starting with the PDB file downloaded from the Protein Data Bank, the first step was to hydrogenate it followed by formation of salt bridges, by use of  keywords ADD-H and SITE=(SALT).  The resulting system was then optimized using the following set of keywords:

CHARGE=0 GRADIENTS DUMP=2D OUTPUT HTML THREADS=1 MOZYME EPS=78.4 PM6-D3H4 T=2W

Each keyword was considered necessary for the following reasons:

Keyword Reason
CHARGE=0 By specifying the charge, if a geometric fault occurs that would cause the charge to change, when a restart was done, the error would be detected and the job stopped
GRADIENTS To show that the geometry was, in fact, near the bottom of a potential well. If the final gradient had been large, then the optimization would have been incomplete.
DUMP=2D This job was run on a Mac computer along with 23 other jobs.  There is a fault in MacOS that can cause a crash if the disk is accessed frequently.  These crashes were avoided by making restart files every two days instead of two hours.
OUTPUT Output file sizes can be minimized by not printing some geometries.  Only essential geometries are printed when OUTPUT is used.
HTML This causes a simple web-page and a PDB file of the final geometry to be generated.
THREADS=1 Restricts the job to use only one thread.  This reduces overhead when 23 other jobs were also running, and resulted in the machine spending over 98% of its CPU time working on MOPAC jobs . 
MOZYME Essential for large, i.e., greater than about 600 atoms, systems.
EPS=78.4 The system is modeled in aqueous media, using the COSMO implicit solvation method.
PM6-D3H4 This is the method of choice for modeling protein systems.
T=2W Each job was allowed to run for a maximum of two weeks.

All 43,698 geometric parameters were flagged for optimization, and a keyword that is normally used when modeling proteins, CUTOFF, was not used in this optimization. The resulting job thus represents the most demanding calculation that could be run.

The first job used all two weeks that were allocated, and terminated when it ran out of time.  The job was then restarted using the original keywords plus the extra keyword RESTART.  This job ran to completion.  Together, the two jobs used about 20 CPU days.

Because of the large computational effort, only the PM6-D3H4 PDB and unconstrained calculations were run, these produced heats of formation of  -79946.7 and -81302.0 kcal mol-1, respectively. The two constrained calculations were run using the old options, and should be ignored.

The purpose of this calculation was to demonstrate that quite large proteins can be modeled.  This is not to suggest that they should be modeled: Any attempt to do any real research work using large proteins would be extremely tedious, and would require extraordinary care - even one simple mistake could invalidate a three CPU week simulation, and result in severe frustration. Instead, the calculation is intended to show that the method can be used in computational research with confidence when smaller proteins, up to say 7000 to 8000 atoms, are involved.